Data Profiling Using Attribute Clustering
نویسنده
چکیده
Finding trends in database data is hard when presented with data sets containing many attributes (columns). The difficulty is increased when the data is in text fields and may include large summary or remarks fields. This paper discusses an approach that uses attribute level clustering in order to discover trends or profiles in the data. This is different from traditional uses of clustering in that each attribute is clustered separately and then the results are combined to define profiles. For example, in a case study of the Global Terrorism Database (GTD) data set, there are 98 columns (attributes) in the data. A profile might be defined by a particular group, attack type, weapon type and by specific information found in larger remarks-type fields. The profiles will show the values of these attributes along with all the records that matched that profile.
منابع مشابه
A Fuzzy C-means Algorithm for Clustering Fuzzy Data and Its Application in Clustering Incomplete Data
The fuzzy c-means clustering algorithm is a useful tool for clustering; but it is convenient only for crisp complete data. In this article, an enhancement of the algorithm is proposed which is suitable for clustering trapezoidal fuzzy data. A linear ranking function is used to define a distance for trapezoidal fuzzy data. Then, as an application, a method based on the proposed algorithm is pres...
متن کاملKnowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services
The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...
متن کاملGenetic Relationships among Three Yarrow Species Based on Phenotypic Traits and Peroxidase Profiling
Fifteen yarrow populations from different species Achillea millefolium L., A. biebersteinii L. and A. nobilis, from different geographical areas of Iran were studied using 24 morphological traits and peroxidase profiles. Comparison of mean values of different phenotypic traits show A. millefolium and A. biebersteinii L. had higher plant height and crown diameter; however, A. nobilis had higher ...
متن کاملKnowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services
The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...
متن کاملClustering Categorical Data Based on Combinations of Attribute Values
Clustering is an important technique for exploratory data analysis. While most of the earlier clustering algorithms focused on numerical data, real-world problems and data mining applications frequently involve categorical data. Here, we propose a new clustering algorithm for categorical data that is based on the frequency of attribute value combinations. Our algorithm finds all the combination...
متن کامل